{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    },
    "gpuClass": "standard",
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [
        "# Customize your own embeddings database\n",
        "\n",
        "txtai supports a number of different database and vector index backends, including external databases. With modern hardware, it's amazing how far a single node index can take us. Easily into the hundreds of millions and even billions of records.\n",
        "\n",
        "txtai provides maximum flexibility in creating your own embeddings database. Sensible defaults are used out of the box. So unless you seek out this configuration, it's not necessary. This notebook will explore the options available when you do want to customize your embeddings database.\n",
        "\n",
        "More on [embeddings configuration settings can be found here](https://neuml.github.io/txtai/embeddings/configuration). "
      ],
      "metadata": {
        "id": "-xU9P9iSR-Cy"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Install dependencies\n",
        "\n",
        "Install `txtai` and all dependencies."
      ],
      "metadata": {
        "id": "shlUi2kKS7KT"
      }
    },
    {
      "cell_type": "code",
      "execution_count": 39,
      "metadata": {
        "id": "xEvX9vCpn4E0"
      },
      "outputs": [],
      "source": [
        "%%capture\n",
        "!pip install git+https://github.com/neuml/txtai#egg=txtai[database,similarity] datasets"
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Load dataset\n",
        "\n",
        "This example will use the `ag_news` dataset, which is a collection of news article headlines. We'll use a subset of 25,000 headlines."
      ],
      "metadata": {
        "id": "408IyXzKFSiG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import timeit\n",
        "\n",
        "from datasets import load_dataset\n",
        "\n",
        "def timer(embeddings, query=\"red sox\"):\n",
        "  elapsed = timeit.timeit(lambda: embeddings.search(query), number=250)\n",
        "  print(f\"{elapsed / 250} seconds per query\")\n",
        "\n",
        "dataset = load_dataset(\"ag_news\", split=\"train\")[\"text\"][:25000]"
      ],
      "metadata": {
        "id": "IQ_ns6YvFRm1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# NumPy\n",
        "\n",
        "Let's start with the simplest possible embeddings database. This will just be a thin wrapper around vectorizing text with sentence-transformers, storing the results as a NumPy array and running similarity queries."
      ],
      "metadata": {
        "id": "K15V3Sj_CvG7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "from txtai.embeddings import Embeddings\n",
        "\n",
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"backend\": \"numpy\"})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))"
      ],
      "metadata": {
        "id": "DMqiTrTbC-VJ"
      },
      "execution_count": 41,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"red sox\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "hcAcJikVDMNQ",
        "outputId": "a587620a-5657-4082-84ee-3d73114e1d3a"
      },
      "execution_count": 42,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[(19831, 0.6780003309249878),\n",
              " (18302, 0.6639199256896973),\n",
              " (16370, 0.6617192029953003)]"
            ]
          },
          "metadata": {},
          "execution_count": 42
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FQYx76IgMinE",
        "outputId": "9375bf2b-e641-4b01-cc8a-b400c5baf399"
      },
      "execution_count": 43,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"backend\": \"numpy\",\n",
            "  \"build\": {\n",
            "    \"create\": \"2023-05-16T13:38:32Z\",\n",
            "    \"python\": \"3.10.11\",\n",
            "    \"settings\": {\n",
            "      \"numpy\": \"1.22.4\"\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"5.6.0\"\n",
            "  },\n",
            "  \"dimensions\": 384,\n",
            "  \"offset\": 25000,\n",
            "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
            "  \"update\": \"2023-05-16T13:38:32Z\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "The embeddings instance above vectorizes the text and stores the content as a NumPy array. Array index positions are returned with similarity scores. While the same can easily be done using sentence-transformers, using the txtai framework makes it easy to swap out different options as seen next."
      ],
      "metadata": {
        "id": "NkHMOoE9L4Nw"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# SQLite and NumPy\n",
        "\n",
        "The next combination we'll test is a SQLite database with a NumPy array."
      ],
      "metadata": {
        "id": "AtEdP7Utw3mk"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"content\": \"sqlite\", \"backend\": \"numpy\"})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))"
      ],
      "metadata": {
        "id": "DPWrubv5oOn7"
      },
      "execution_count": 44,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now let's run a search."
      ],
      "metadata": {
        "id": "SDaDLMyXLGe1"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"red sox\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ILSfWHxVHex0",
        "outputId": "f4cc3f44-da63-4be9-9187-58386fad58df"
      },
      "execution_count": 45,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '19831',\n",
              "  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',\n",
              "  'score': 0.6780003309249878},\n",
              " {'id': '18302',\n",
              "  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',\n",
              "  'score': 0.6639199256896973},\n",
              " {'id': '16370',\n",
              "  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',\n",
              "  'score': 0.6617192029953003}]"
            ]
          },
          "metadata": {},
          "execution_count": 45
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "0IuVqFxUMwe8",
        "outputId": "fa13d767-d549-4b9c-f63e-7e09b05159ee"
      },
      "execution_count": 46,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"backend\": \"numpy\",\n",
            "  \"build\": {\n",
            "    \"create\": \"2023-05-16T13:38:52Z\",\n",
            "    \"python\": \"3.10.11\",\n",
            "    \"settings\": {\n",
            "      \"numpy\": \"1.22.4\"\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"5.6.0\"\n",
            "  },\n",
            "  \"content\": \"sqlite\",\n",
            "  \"dimensions\": 384,\n",
            "  \"offset\": 25000,\n",
            "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
            "  \"update\": \"2023-05-16T13:38:52Z\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Same results as before. The only difference is the content is now available via the associated SQLite database. \n",
        "\n",
        "Let's inspect the ANN object to see how it looks. "
      ],
      "metadata": {
        "id": "B_XnpIpXNKSP"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(embeddings.ann.backend.shape)\n",
        "print(type(embeddings.ann.backend))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "c2FVQlxSLKgP",
        "outputId": "1b8e454d-c5f7-4d58-b2b0-a1a9eaca4b9b"
      },
      "execution_count": 47,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "(25000, 384)\n",
            "<class 'numpy.memmap'>\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "As expected, it's a NumPy array. Let's calculate how long a search query takes to execute.\n"
      ],
      "metadata": {
        "id": "00dnum6fNNM0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "JvodDi4w6JxS",
        "outputId": "3671753d-0dd1-47fe-bef6-04841204cb6b"
      },
      "execution_count": 48,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.028768999292000445 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Not too bad at all!\n",
        "\n"
      ],
      "metadata": {
        "id": "eqom3l_87jFv"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# SQLite and PyTorch\n",
        "\n",
        "Let's now try a PyTorch backend."
      ],
      "metadata": {
        "id": "Y54lSbQd5rzy"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"content\": \"sqlite\", \"backend\": \"torch\"})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))"
      ],
      "metadata": {
        "id": "OYAqPoTmNaNN"
      },
      "execution_count": 49,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Let's run a search again."
      ],
      "metadata": {
        "id": "DT52loQU7zmt"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"red sox\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zvlrEunM7vi4",
        "outputId": "e591495f-0622-4291-fbbd-cbe655f92a07"
      },
      "execution_count": 50,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '19831',\n",
              "  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',\n",
              "  'score': 0.678000271320343},\n",
              " {'id': '18302',\n",
              "  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',\n",
              "  'score': 0.6639199256896973},\n",
              " {'id': '16370',\n",
              "  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',\n",
              "  'score': 0.6617191433906555}]"
            ]
          },
          "metadata": {},
          "execution_count": 50
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "QmJZb56SM6Up",
        "outputId": "a6e9b73a-6248-43ca-a57f-c3a4fec9112f"
      },
      "execution_count": 51,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"backend\": \"torch\",\n",
            "  \"build\": {\n",
            "    \"create\": \"2023-05-16T13:39:19Z\",\n",
            "    \"python\": \"3.10.11\",\n",
            "    \"settings\": {\n",
            "      \"torch\": \"2.0.0+cu118\"\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"5.6.0\"\n",
            "  },\n",
            "  \"content\": \"sqlite\",\n",
            "  \"dimensions\": 384,\n",
            "  \"offset\": 25000,\n",
            "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
            "  \"update\": \"2023-05-16T13:39:19Z\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "And once against inspect the ANN object."
      ],
      "metadata": {
        "id": "jqdMjDiO8Dy3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "print(embeddings.ann.backend.shape)\n",
        "print(type(embeddings.ann.backend))"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "u5JFEJ-q5Zow",
        "outputId": "fffb3e3d-2d5c-4f20-877a-cce52836e5f3"
      },
      "execution_count": 52,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "torch.Size([25000, 384])\n",
            "<class 'torch.Tensor'>\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "As expected, this time the backend is a Torch tensor. Next we'll calculate the average search time."
      ],
      "metadata": {
        "id": "jGwHvEHE6ALO"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "7aAI_jUm6goL",
        "outputId": "e4ec71dc-c1d8-40ff-9e68-113a5da1ad1e"
      },
      "execution_count": 53,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.0198183048359997 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "A bit faster since Torch uses the GPU to compute the similarity matrix."
      ],
      "metadata": {
        "id": "mp3nLHz38OIp"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# SQLite and Faiss\n",
        "\n",
        "Now lets run the same code with the standard txtai settings of Faiss + SQLite."
      ],
      "metadata": {
        "id": "8h3SXoGr9YIL"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"content\": True})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))"
      ],
      "metadata": {
        "id": "DQECU7y-9doj"
      },
      "execution_count": 54,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"red sox\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "wGqcmE6M9kJW",
        "outputId": "25c7ef4c-a437-4d64-ebaf-1481d0873ca1"
      },
      "execution_count": 55,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '19831',\n",
              "  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',\n",
              "  'score': 0.6780003309249878},\n",
              " {'id': '18302',\n",
              "  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',\n",
              "  'score': 0.6639199256896973},\n",
              " {'id': '16370',\n",
              "  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',\n",
              "  'score': 0.6617192029953003}]"
            ]
          },
          "metadata": {},
          "execution_count": 55
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zDxqlMT9d-Q3",
        "outputId": "befdb0a1-01b4-4948-94e2-4570efada3ce"
      },
      "execution_count": 56,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"backend\": \"faiss\",\n",
            "  \"build\": {\n",
            "    \"create\": \"2023-05-16T13:39:47Z\",\n",
            "    \"python\": \"3.10.11\",\n",
            "    \"settings\": {\n",
            "      \"components\": \"IVF632,Flat\"\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"5.6.0\"\n",
            "  },\n",
            "  \"content\": true,\n",
            "  \"dimensions\": 384,\n",
            "  \"offset\": 25000,\n",
            "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
            "  \"update\": \"2023-05-16T13:39:47Z\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "zlbc43qg9qKb",
        "outputId": "ae06092c-ff5f-4540-afd5-e054db6ffb84"
      },
      "execution_count": 57,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.00825659705599992 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Everything lines up with the previous examples. Note that Faiss is faster, given it's a vector index. For 25,000 records, the different is negligible but vector index performance increases rapidly for datasets in the million+ range."
      ],
      "metadata": {
        "id": "j5u1GEbV91GH"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# SQLite and HNSW\n",
        "\n",
        "While txtai strives to keep things as simple as possible with many common default settings out of the box, customizing the backend options can lead to increased performance. The next example will store vectors in a HNSW index and customize the index options."
      ],
      "metadata": {
        "id": "f4Hnjfy--ye0"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"content\": True, \"backend\": \"hnsw\", \"hnsw\": {\"m\": 32}})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))"
      ],
      "metadata": {
        "id": "5dqxj2hr_ICl"
      },
      "execution_count": 58,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.search(\"red sox\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "pUw-3WCHFGf9",
        "outputId": "22e827a9-ed34-4847-b41b-70499f85cff0"
      },
      "execution_count": 59,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '19831',\n",
              "  'text': 'Boston Red Sox Team Report - September 6 (Sports Network) - Two of the top teams in the American League tangle in a possible American League Division Series preview tonight, as the West-leading Oakland Athletics host the wild card-leading Boston Red Sox for the first of a three-game set at the ',\n",
              "  'score': 0.678000271320343},\n",
              " {'id': '18302',\n",
              "  'text': 'BASEBALL: RED-HOT SOX CLIP THE ANGELS #39; WINGS BOSTON RED SOX fans are enjoying their best week of the season. While their beloved team swept wild-card rivals Anaheim in a three-game series to establish a nine-game winning streak, the hated New York Yankees endured the heaviest loss in their history.',\n",
              "  'score': 0.6639199256896973},\n",
              " {'id': '16370',\n",
              "  'text': 'Boston Red Sox Team Report - September 1 (Sports Network) - The red-hot Boston Red Sox hope to continue rolling as they continue their three-game set with the Anaheim Angels this evening at Fenway Park.',\n",
              "  'score': 0.6617191433906555}]"
            ]
          },
          "metadata": {},
          "execution_count": 59
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "embeddings.info()"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "u5DAfB1MeCLF",
        "outputId": "d00dec7f-e1da-49f1-afd3-aa852238769f"
      },
      "execution_count": 60,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"backend\": \"hnsw\",\n",
            "  \"build\": {\n",
            "    \"create\": \"2023-05-16T13:40:21Z\",\n",
            "    \"python\": \"3.10.11\",\n",
            "    \"settings\": {\n",
            "      \"efconstruction\": 200,\n",
            "      \"m\": 32,\n",
            "      \"seed\": 100\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"5.6.0\"\n",
            "  },\n",
            "  \"content\": true,\n",
            "  \"deletes\": 0,\n",
            "  \"dimensions\": 384,\n",
            "  \"hnsw\": {\n",
            "    \"m\": 32\n",
            "  },\n",
            "  \"metric\": \"ip\",\n",
            "  \"offset\": 25000,\n",
            "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
            "  \"update\": \"2023-05-16T13:40:21Z\"\n",
            "}\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings)"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "Z3Qg6EEjFIom",
        "outputId": "49d01a91-5d53-4792-8244-316b6e79cc20"
      },
      "execution_count": 61,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.006280380824000531 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Once again, everything matches up with the previous examples. There is a negligible performance difference vs Faiss.\n",
        "\n",
        "Hnswlib powers a number of popular vector databases. It's definitely an option worth evaluating."
      ],
      "metadata": {
        "id": "JREMWY5NHAX-"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# External Vectorization\n",
        "\n",
        "txtai has a number of built-in vectorizers backed by Hugging Face Transformers and Sentence Transformers. Just like other txtai modules, vectorization can also be customized.\n",
        "\n",
        "The next example uses the Hugging Face Inference API to vectorize text.\n"
      ],
      "metadata": {
        "id": "Wuj_gZeBs57O"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import numpy as np\n",
        "import requests\n",
        "\n",
        "BASE = \"https://api-inference.huggingface.co/pipeline/feature-extraction\"\n",
        "\n",
        "def transform(inputs):\n",
        "  # Your API provider of choice\n",
        "  response = requests.post(f\"{BASE}/sentence-transformers/all-MiniLM-L6-v2\", json={\"inputs\": inputs})\n",
        "  return np.array(response.json(), dtype=np.float32)\n",
        "\n",
        "embeddings = Embeddings({\"transform\": transform, \"backend\": \"numpy\", \"content\": True})\n",
        "embeddings.index([(0, \"sunny\", None), (1, \"rainy\", None)])\n",
        "embeddings.search(\"nice day\")  "
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "FAhg1TcdtWIJ",
        "outputId": "22e34bd0-b6d3-40b2-8948-d08397d744f5"
      },
      "execution_count": 62,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "[{'id': '0', 'text': 'sunny', 'score': 0.28077083826065063},\n",
              " {'id': '1', 'text': 'rainy', 'score': 0.18051263689994812}]"
            ]
          },
          "metadata": {},
          "execution_count": 62
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Configuration storage\n",
        "\n",
        "Configuration is passed to an embeddings instance as a dictionary. When saving an embeddings instance, the default behavior is to save configuration as a pickled object. JSON can alternatively be used."
      ],
      "metadata": {
        "id": "RvHkAloSl4y3"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"content\": True, \"format\": \"json\"})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))\n",
        "\n",
        "# Save embeddings\n",
        "embeddings.save(\"index\")\n",
        "\n",
        "!cat index/config.json"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "X1DYuyPmmSgU",
        "outputId": "975c0f7f-f38e-478e-a8b8-16155f1f4865"
      },
      "execution_count": 63,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "{\n",
            "  \"path\": \"sentence-transformers/all-MiniLM-L6-v2\",\n",
            "  \"content\": true,\n",
            "  \"format\": \"json\",\n",
            "  \"dimensions\": 384,\n",
            "  \"backend\": \"faiss\",\n",
            "  \"offset\": 25000,\n",
            "  \"build\": {\n",
            "    \"create\": \"2023-05-16T13:40:49Z\",\n",
            "    \"python\": \"3.10.11\",\n",
            "    \"settings\": {\n",
            "      \"components\": \"IVF632,Flat\"\n",
            "    },\n",
            "    \"system\": \"Linux (x86_64)\",\n",
            "    \"txtai\": \"5.6.0\"\n",
            "  },\n",
            "  \"update\": \"2023-05-16T13:40:49Z\"\n",
            "}"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "Looking at the stored configuration, it's almost identical to an `embeddings.info()` call. This is by design, JSON configuration is designed to be human-readable. This is a good option when sharing an embeddings database on the [Hugging Face Hub](https://huggingface.co/models)."
      ],
      "metadata": {
        "id": "ETdcrP7dqI8J"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# SQLite vs DuckDB\n",
        "\n",
        "The last thing we'll explore is the database backend.\n",
        "\n",
        "[SQLite](https://sqlite.org/index.html) is a row-oriented database, [DuckDB](https://duckdb.org/) is column-oriented. This design difference is important to note and a factor to consider when evaluating the expected workload. Let's explore."
      ],
      "metadata": {
        "id": "z6zmhGRVHawG"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"content\": \"sqlite\"})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))"
      ],
      "metadata": {
        "id": "KZ-x_53SHsNK"
      },
      "execution_count": 64,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings, \"SELECT text FROM txtai where id = 3980\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "LQZCGu-9H70K",
        "outputId": "06e41f55-687d-4d98-e194-13efdd2991ef"
      },
      "execution_count": 65,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.00012401376399975562 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings, \"SELECT count(*), text FROM txtai group by text order by count(*) desc\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "SwgR2TwPHvdP",
        "outputId": "d88a457a-7789-4881-9420-914e2992d132"
      },
      "execution_count": 66,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.03863514600000053 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "# Create embeddings instance\n",
        "embeddings = Embeddings({\"path\": \"sentence-transformers/all-MiniLM-L6-v2\", \"content\": \"duckdb\"})\n",
        "\n",
        "# Index data\n",
        "embeddings.index((x, text, None) for x, text in enumerate(dataset))"
      ],
      "metadata": {
        "id": "ZdrLBOmaIKbF"
      },
      "execution_count": 67,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings, \"SELECT text FROM txtai where id = 3980\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "cce1PAVqIMpU",
        "outputId": "51b8749b-7a87-42d0-af17-2710f8460c68"
      },
      "execution_count": 68,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.0038918176440001844 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "timer(embeddings, \"SELECT count(*), text FROM txtai group by text order by count(*) desc\")"
      ],
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "lxBfoh3TINmE",
        "outputId": "f56ed46b-abb6-4f6f-e986-10ec570e8cbe"
      },
      "execution_count": 69,
      "outputs": [
        {
          "output_type": "stream",
          "name": "stdout",
          "text": [
            "0.0198518766039997 seconds per query\n"
          ]
        }
      ]
    },
    {
      "cell_type": "markdown",
      "source": [
        "While the dataset of 25,000 rows is small, we can start to see the differences. SQLite has a much faster single row retrieval time. DuckDB does better with an aggregate query. This is a product of a row-oriented vs column oriented database and a factor to consider when developing a solution."
      ],
      "metadata": {
        "id": "_hBm-yZTJtUQ"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Wrapping up\n",
        "\n",
        "This notebook explored different combinations of database and vector index backends. With modern hardware, it's amazing how far a single node index can take us. Easily into the hundreds of millions and even billions of records. When a hardware bottleneck becomes an issue, external vector databases are one option to consider. Another is [building a distributed txtai embeddings cluster](https://neuml.github.io/txtai/api/cluster/).\n",
        "\n",
        "There is power in simplicity. Many paid services try to convince us that signing up for an API account is the best place to start. In some cases, such as teams with very few to no developers, this is true. But for teams with developers, options like txtai should be evaluated."
      ],
      "metadata": {
        "id": "4L8smyyXc8q8"
      }
    }
  ]
}